I might remove that part - RStudio is an integrated development environment (IDE) - It provides a (much prettier) interface for the R software - R is integrated into RStudio, so you never actually have to open R
R Studio gives a functionality of creating projects: self-contained working space (i.e. working directory), to which R will refer to, when looking for and saving files.
We’re going to create a new project in RStudio:
FileNew ProjectEmpty projectCreate projectThis is one suggestion of how your R project can look like. Let’s go ahead and create the folders.
EnterCtrl + EnterRun button on right left - current line or
selectionWe’re going to work with a script. Let’s create one now and save it
in the scripts directory.
FileNew FileR ScriptUntitled script will appear in the source pane.
Save it using floppy disc icon.intro-to-r.RThe console shows it’s ready to get new commands with
> sign. It will show + sign if it still
requires input for the command to be executed.
Sometimes you don’t know what is missing/ you change your mind and
want to run something else, or your code is running much too long and
you just want it to stop. The way to do it is to hit
Esc.
A great power of R lays in packages add-on sets of functions that are
build by the community and once they go through a quality process they
are available to download from a repository called CRAN. They need to be
explicitly activated. Now, we will be using tidyverse
package, which is actually a collection of useful packages. Another
package that will be useful for us is here.
If you have have not installed this package earlier, please do so.
You can check if you have it installed in the Packages pane
in the bottom-right window.
install.packages('tidyverse')
install.packages('here')
You need to install package only once, but you will need to load it
each time you want to use its functionalities. To do that you use
library() command:
library(tidyverse)
library(here)
Credit:kaggle.com
You have created a project which is your working directory, and a number of subfolders, that will help you organise your project better. But now each time you will save or retrieve a file from those folders, you will need to specify the path from the folder your in (most likely scripts).
That can become complicated and can become a reproducibility problem if the person using your code (e.g. future you) is working in a different subfolder.
here() to the rescue! This package provides absolute
paths from the root (main directory) of your project.
Credit:Allison horst
here('data')
## [1] "C:/Users/awilczynski/Desktop/R Cafe/geospatial-data-carpentry-tud-2022-11/data"
download.file('https://raw.githubusercontent.com/datacarpentry/r-intro-geospatial/master/_episodes_rmd/data/gapminder_data.csv', here('data','gapminder_data.csv'), mode = 'wb')
1+100
## [1] 101
12/7
## [1] 1.714286
3*5
## [1] 15
We can store values in variables using the assignment operator
<-, like this:
x <- 1/40
Notice that assignment does not print a value. Instead, we stored it for later in something called a variable. x now contains the value 0.025:
x
## [1] 0.025
Look for the Environment tab in one of the panes of RStudio, and you
will see that x and its value have appeared. Our variable
x can be used in place of a number in any calculation that
expects a number, e.g. when caclulating a square root:
sqrt(x)
## [1] 0.1581139
Variables can be also reassigned:
x <- 100
x
## [1] 100
You can use the ‘old’ value when reassigning the value
y <- x * 2 # you can use value stored in object x to create y
y
## [1] 200
So far we’ve looked on individual values. Now we will move to a data structure called vectors. Vectors are arrays of values of a same data type (will explain in a second :) ) .
You can create a vector with a c() function.
numeric_vector <- c(2, 6, 3) # vector of numbers - numeric data type.
numeric_vector
## [1] 2 6 3
character_vector <- c('banana', 'apple', 'orange') # vector of words - more precisely strings of characters- character data type
character_vector
## [1] "banana" "apple" "orange"
logical_vector <- c(TRUE, FALSE, TRUE) # vector of logical values (is something true or false?)- logical data type.
logical_vector
## [1] TRUE FALSE TRUE
The combine function, c(), will also append things to an
existing vector:
ab_vector <- c('a', 'b')
ab_vector
## [1] "a" "b"
abcd_vector <- c(ab_vector, 'DC')
abcd_vector
## [1] "a" "b" "DC"
A common operation you want to perform is to remove all the missing
values (in R denoted as NA). Let’s have a look how to do
it:
with_na <- c(1, 2, 1, 1, NA, 3, NA ) # vector including missing value
First, let’s try to calculate mean for the values in this vector
mean(with_na) # mean() function cannot interpret the missing values
## [1] NA
mean(with_na, na.rm = T) # You can add the argument na.rm=TRUE to calculate the result while ignoring the missing values.
## [1] 1.6
However, sometimes, you would like to have the NA
completely removed from your vector. for this you need to identify which
elements of the vector hold missing values with is.na()
function.
is.na(with_na) # This will produce a vector of logical values, stating if a statement 'This element of the vector is a missing value' is true or not
## [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
!is.na(with_na) # # The ! operator means negation ,i.e. not is.na(with_na)
## [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE
We know which elements in the vectors are NA. Now we
need to retrieve the subset of the with_na vector that is
not NA. Any subsetting in R is done with
square brackets[ ].
without_na <- with_na[!is.na(with_na)] # this notation will return only the elements that have TRUE on their respective positions
without_na
## [1] 1 2 1 1 3
Another important data structure is called a factor. Factors look like character data, but are used to represent categorical information.
Factors create a structured relation between the different levels (values) of a categorical variable, such as days of the week or responses to a question in a survey. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.
Once created, factors can only contain a pre-defined set of values, known as levels.
nordic_str <- c('Norway', 'Sweden', 'Norway', 'Denmark', 'Sweden')
nordic_str # regular character vectors printed out
## [1] "Norway" "Sweden" "Norway" "Denmark" "Sweden"
nordic_cat <- factor(nordic_str) # factor() function converts a vector to factor data type
nordic_cat # With factors, R prints out additional information - 'Levels'
## [1] Norway Sweden Norway Denmark Sweden
## Levels: Denmark Norway Sweden
R will treat each unique value from a factor vector as a level and (silently) assign numerical values to it. This come in handy when performing statistical analysis. You can inspect and adapt levels of the factor.
levels(nordic_cat) # returns all levels of a factor vector.
## [1] "Denmark" "Norway" "Sweden"
nlevels(nordic_cat) # returns number of levels in a vector
## [1] 3
Note that R sorts the levels in the alphabetic order, not in the order of occurrence in the vector. R assigns value of 1 to level ‘Denmark’, 2 to ‘Norway’ and 3 to ‘Sweden’. This is important as it can affect e.g. the order in which categories are displayed in a plot or which category is taken as a baseline in a statistical model.
You can reorder the categories using factor()
function.
nordic_cat <- factor(nordic_cat, levels = c('Norway' , 'Denmark', 'Sweden')) # now Norway should be the first category, Denmark second and Sweden third
nordic_cat
## [1] Norway Sweden Norway Denmark Sweden
## Levels: Norway Denmark Sweden
str(nordic_cat) # you can also inspect vectors with str() function. In facto vectors, it shows the underlying values of each category. You can also see the structure in the environment tab of RStudio.
## Factor w/ 3 levels "Norway","Denmark",..: 1 3 1 2 3
There is more than one way to reorder factors. Later in the lesson,
we will use fct_relevel() function from
forcats package to do the reordering.
Remember that once created, factors can only contain a pre-defined
set of values, known as levels. It means that whenever you try to add
something to the factor vector outside of this set, it will become an
unknown/missing value detonated by R as NA.
nordic_str
## [1] "Norway" "Sweden" "Norway" "Denmark" "Sweden"
nordic_cat2 <- factor(nordic_str, levels = c('Norway', 'Denmark'))
nordic_cat2 # since we have not included Sweden in the list of factor levels, it has become NA.
## [1] Norway <NA> Norway Denmark <NA>
## Levels: Norway Denmark
Now we turn to the bread-and-butter of working with R: working with tabular data. In R data are stored in a data structure called data frames.
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
read_csv() is a function used to read coma separated
data files (.csv format)). There are other functions for
files separated with other delimiters. We’re gonna read in the gap
minder data set with information about countries’ size, GDP and average
life expectancy in different years.
gapminder <- read_csv(here('data','gapminder_data.csv') )
Let’s investigate the gapminder data frame a bit; the first thing we should always do is check out what the data looks like.
It is important to see if all the variables (columns) have the data type that we require. Otherwise we can run into trouble.
str(gapminder)
## spec_tbl_df [1,704 Ă— 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : num [1:1704] 1952 1957 1962 1967 1972 ...
## $ pop : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. pop = col_double(),
## .. continent = col_character(),
## .. lifeExp = col_double(),
## .. gdpPercap = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We can see that the gapminder object is a data.frame
with 1704 observations/ rows and 6 variables/columns. In each line after
a $ sign, we see the name of each column, its type and
first few values.
There are multiple ways to explore a data set. Here are just a few examples
head(gapminder) # see first 5 rows of the data set
## # A tibble: 6 Ă— 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
summary(gapminder) # gives basic statistical information about each column. Information format differes by data type.
## country year pop continent
## Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
## Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
## Mode :character Median :1980 Median :7.024e+06 Mode :character
## Mean :1980 Mean :2.960e+07
## 3rd Qu.:1993 3rd Qu.:1.959e+07
## Max. :2007 Max. :1.319e+09
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
When you’re analyzing a data set, you often need to access its specific elements.There are different way to go about it, and we will explore some of them.
One handy way to access a column is using it’s name and a dollar sign
$:
country_vec <- gapminder$country # Notation means: From dataset gapminder, give me column country. You can see that the column accessed in this way is just a vector of characters.
head(country_vec)
## [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
## [6] "Afghanistan"
Note that the calling a column with a $ sign will return
a vector, it’s not a data frame anymore.
Let’s start manipulating the data.
First we will adapt our dataset, by keeping only the columns we’re
interested in using the select() function from
dplyr package:
year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)
## # A tibble: 6 Ă— 3
## year country gdpPercap
## <dbl> <chr> <dbl>
## 1 1952 Afghanistan 779.
## 2 1957 Afghanistan 821.
## 3 1962 Afghanistan 853.
## 4 1967 Afghanistan 836.
## 5 1972 Afghanistan 740.
## 6 1977 Afghanistan 786.
Now, this is not the most common notation when working with
dplyr package. dplyr offers an operator
%>% called a pipe, which allows you build up a very
complicated commands in a readable way.
In newer installation of R you can also find a notation
|> . This pipe does exactly the same, the only
difference is that you don’t need to load any pacakges to have it
available.
The select() statement with pipe would look like
that:
year_country_gdp <- gapminder %>%
select(year,country,gdpPercap)
head(year_country_gdp)
## # A tibble: 6 Ă— 3
## year country gdpPercap
## <dbl> <chr> <dbl>
## 1 1952 Afghanistan 779.
## 2 1957 Afghanistan 821.
## 3 1962 Afghanistan 853.
## 4 1967 Afghanistan 836.
## 5 1972 Afghanistan 740.
## 6 1977 Afghanistan 786.
First we define data set, then with the use of pipe we pass it on to
the select() function. This way we can chain multiple
functions together, which we will be doing now.
We already now how to select only the needed columns. But now, we
also want to filter the data set via certain condition with
filter() function. Instead doing it in separate steps , we
can do it all together. In the gapminder data set, we want
to see the results only for Europe for 21st century.
year_country_gdp_euro <- gapminder %>%
filter(continent == "Europe" & year> 2000) %>%
select(year, country, gdpPercap)
Challenge Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the African values for life expectancy, country and year, but not for other Continents. How many rows does your dataframe have and why?
So far, we have created a dataset for one of the continents
represented in the gapminder dataset. But rather than doing
that, we want to know statistics about all of the continents, presented
by group.
gapminder %>% # select the dataset
group_by(continent) %>% # group by continent
summarize(mean_gdpPercap = mean(gdpPercap)) # summarize function creates statistics for the data set
## # A tibble: 5 Ă— 2
## continent mean_gdpPercap
## <chr> <dbl>
## 1 Africa 2194.
## 2 Americas 7136.
## 3 Asia 7902.
## 4 Europe 14469.
## 5 Oceania 18622.
Challenge Calculate the average life expectancy per country. Which country has the longest average life expectancy and which has the shortest average life expectancy?
You can also group by multiple columns:
gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
## # A tibble: 60 Ă— 3
## # Groups: continent [5]
## continent year mean_gdpPercap
## <chr> <dbl> <dbl>
## 1 Africa 1952 1253.
## 2 Africa 1957 1385.
## 3 Africa 1962 1598.
## 4 Africa 1967 2050.
## 5 Africa 1972 2340.
## 6 Africa 1977 2586.
## 7 Africa 1982 2482.
## 8 Africa 1987 2283.
## 9 Africa 1992 2282.
## 10 Africa 1997 2379.
## # … with 50 more rows
On top of this, you can also make multiple summaries of those groups:
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(
mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop)
)
If you need a number of observations per group, you can use the
count() function
gapminder %>%
group_by(continent) %>%
count()
## # A tibble: 5 Ă— 2
## # Groups: continent [5]
## continent n
## <chr> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
Frequently you’ll want to create new columns based on the values in
existing columns, for example to do unit conversions, or to find the
ratio of values in two columns. For this we’ll use
mutate().
gapminder_gdp <- gapminder %>%
mutate(gdpBillion = gdpPercap*pop/10^9)
head(gapminder_gdp)
## # A tibble: 6 Ă— 7
## country year pop continent lifeExp gdpPercap gdpBillion
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779. 6.57
## 2 Afghanistan 1957 9240934 Asia 30.3 821. 7.59
## 3 Afghanistan 1962 10267083 Asia 32.0 853. 8.76
## 4 Afghanistan 1967 11537966 Asia 34.0 836. 9.65
## 5 Afghanistan 1972 13079460 Asia 36.1 740. 9.68
## 6 Afghanistan 1977 14880372 Asia 38.4 786. 11.7
Package ggplot2 is a powerful plotting system. I will
introduce key features of ggplot. Later today/ on Monday
you will use this package to visualize geospatial data. gg
stands for grammar of graphics, the idea that three components needed to
create a graph are: - data - aesthetics - coordinate system on which we
map the data ( what is represented on x axis, what on y axis) -
geometries - visual representation of the data (points, bars, etc.)
fun part about ggplot is that you can then add
additional layers to the plot providing more information and make it
more beautiful.
First, lets plot distribution of life expectancy in the
gapminder dataset.
gapminder %>% # data layer
ggplot( aes(x = lifeExp)) + # aesthetics layer
geom_histogram() # geometry layer
You can see that in ggplot you use + as a
pipe, to add layers. Within ggplot call, it is the only
pipe that will work. But, it is possible to chain operations on a
dataset with a pipe that we have already learned: %>% (
or |>) and follow them but ggplot grammar.
Let’s create another plot, this time only on a subset of observations:
gapminder %>% # we select a dataset
filter(year == 2007,
continent == 'Americas') %>% # and filter it to keep only one year and one continent
ggplot(aes(x = country, y = gdpPercap)) + # we create aesthetics, both x and y axis represent values of columns
geom_col() # we select a column graph as a geometry
Now, you can iteratively improve how the plot looks. For example, you might want to flip it, to better display the labels.
gapminder %>%
filter(year == 2007,
continent == 'Americas') %>%
ggplot(aes(x = country, y = gdpPercap)) +
geom_col()+
coord_flip()
One thing you might want to change here is the order in which countries are displayed. It would be easier to compare GDP per capita, if theY were showed in order. To do that, we need to reorder factor levels (you remember, we’ve already done this before). the order of the levels will depend on another variable - GDP per capita.
gapminder %>%
filter(year == 2007,
continent == 'Americas') %>%
mutate(country = fct_reorder(country, gdpPercap )) %>%
ggplot(aes(x = country , y = gdpPercap)) +
geom_col() +
coord_flip()
Let’s make things more colorful - let’s represent the average life expectancy of a country by color
gapminder %>%
filter(year == 2007,
continent == 'Americas') %>%
mutate(country = fct_reorder(country, gdpPercap )) %>%
ggplot(aes(x = country, y = gdpPercap, fill = lifeExp )) + # fill argument for coloring surfaces, color for points and lines
geom_col()+
coord_flip()
We can also adapt the color scale. Common choice that is used for its
colorblind-proofness is viridis package.
gapminder %>%
filter(year == 2007,
continent == 'Americas') %>%
mutate(country = fct_reorder(country, gdpPercap )) %>%
ggplot(aes(x = country, y = gdpPercap, fill = lifeExp )) +
geom_col()+
coord_flip()+
scale_fill_viridis_c() # _c stands for continous scale
Maybe we don’t need that much information about the life expectancy. We only want to know if it’s below or above average.
plot_2007_amr <- # this time let's save the plot in the object.
gapminder %>%
filter(year == 2007,
continent == 'Americas') %>%
mutate(country_reordered = fct_reorder(country, gdpPercap ),
lifeExpCat = if_else(lifeExp >= mean(lifeExp), 'high', 'low' )
) %>%
ggplot(aes(x = country_reordered, y = gdpPercap, fill = lifeExpCat )) +
geom_col()+
coord_flip()+
scale_fill_viridis_d() # _c stands for continous scale ; _d for discrete
Since we saved a plot as an object, nothing has been printed out.
Just like with any other object in R, if you want to see
it, you need to call it.
plot_2007_amr
Now we can make use of the saved object and add things to it.
Let’s also give it a title and name the axes:
plot_2007_amr <-
plot_2007_amr +
ggtitle('GDP per capita in Americas', subtitle = 'Year 2007') +
xlab('Country')+
ylab('GDP per capita')
plot_2007_amr
Once we are happy with our plot we can save it in a format of our choice. Remember to save it in the dedicated folder.
ggsave(here('fig_output','plot_2007_amr.pdf') ) # By default, ggsave() saves the last displayed plot, but you can also explicitly name the plot you want to save
Another output of your work you want to save is a cleaned dataset. In your analysis, you can then load directly that dataset. Say we want to save the data only for Australia:
gapminder_aus <- gapminder %>%
filter(country == 'Australia')
write_csv(gapminder_aus, here('data_output', 'gapminder_australia.csv'))